Extracts truncated UTF-8 sequence validation to BitString class by mward-sudo · Pull Request #718 · bartblast/hologram

mward-sudo · 2026-02-18T23:41:46Z

Closes #714

Dependencies

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

Summary by CodeRabbit

New Features
- Added robust UTF-8 validation and decoding utilities to improve text handling and truncation detection.
Refactor
- Simplified UTF-8 validation flow by centralizing sequence checks and reducing internal helper complexity.
Tests
- Added extensive UTF-8 test coverage covering 1–4 byte sequences, truncation cases, overlong/surrogate/out-of-range cases, and continuation-byte validation.

…ds parameter validation

coderabbitai · 2026-02-18T23:42:03Z

Note

Reviews paused

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds UTF‑8 decoding, validation, and truncation detection utilities to Bitstring and refactors the Erlang unicode module to use those utilities; duplicates of the new utilities appear in bitstring.mjs and tests for the utilities are added.

Changes

Cohort / File(s)	Summary
Bitstring UTF‑8 utilities `assets/js/bitstring.mjs`	Added five static methods: `decodeUtf8CodePoint(bytes,start,length)`, `isValidUtf8CodePoint(codePoint,encodingLength)`, `isValidUtf8ContinuationByte(byte)`, `isValidUtf8Sequence(bytes,start,length)`, and `isTruncatedUtf8Sequence(bytes,start)`. Note: utilities are present in two locations within the file (duplication).
Unicode module refactor `assets/js/erlang/unicode.mjs`	Removed local UTF‑8 decoding/validation helpers; `findValidUtf8Length` now returns a numeric length. Call sites updated to use `Bitstring.isValidUtf8Sequence` and `Bitstring.isTruncatedUtf8Sequence` for validation and truncation checks.
Tests `test/javascript/bitstring_test.mjs`	Added extensive tests covering decoding, continuation byte checks, code point validation, sequence validation, and truncation detection across 1–4 byte sequences and edge cases (overlong, surrogates, out‑of‑range, truncated sequences).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Extract UTF-8 sequence length detection to Bitstring class #706: Changes to Bitstring UTF‑8 handling (adds sequence length helper) that this PR builds on and extends with decoding/validation/truncation utilities.

Suggested reviewers

bartblast

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	❓ Inconclusive	The PR contains scope-related changes: UTF-8 utilities appear duplicated in bitstring.mjs, and tests duplicate validation coverage. While the core functionality aligns with `#714`, the duplications suggest incomplete refactoring.	Verify that duplicated UTF-8 methods and test cases in bitstring.mjs and bitstring_test.mjs are intentional, or consolidate them to maintain a single source of truth for UTF-8 validation logic.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: extracting UTF-8 truncated sequence validation logic to the Bitstring class, which is the primary objective of this PR.
Linked Issues check	✅ Passed	The PR successfully implements the main objectives from `#714`: introduces BitString.isTruncatedUtf8Sequence(), adds supporting UTF-8 helper methods, rationalizes findValidUtf8Length(), and updates call sites to use the new BitString methods.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

assets/js/bitstring.mjs (1)

258-259: Consider hoisting constant lookup objects out of the method bodies.

firstByteMasks (Line 259) and minValueForLength (Line 602) are re-created on every call. If these methods are called in a tight loop (e.g., scanning long byte arrays), the repeated object allocation is unnecessary overhead. Hoisting them to module-level or static class-level constants eliminates this.

♻️ Suggested change

+// Module-level constants for UTF-8 decoding
+const UTF8_FIRST_BYTE_MASKS = {2: 0x1f, 3: 0x0f, 4: 0x07};
+const UTF8_MIN_VALUE_FOR_LENGTH = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
+
 export default class Bitstring {

Then use them in the respective methods:

   static decodeUtf8CodePoint(bytes, start, length) {
     if (length === 1) return bytes[start];
-    const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07};
-    let codePoint = bytes[start] & firstByteMasks[length];
+    let codePoint = bytes[start] & UTF8_FIRST_BYTE_MASKS[length];

   static isValidUtf8CodePoint(codePoint, encodingLength) {
-    const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
-    if (codePoint < minValueForLength[encodingLength]) return false;
+    if (codePoint < UTF8_MIN_VALUE_FOR_LENGTH[encodingLength]) return false;

Also applies to: 601-602

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 258 - 259, The objects firstByteMasks
and minValueForLength are being re-created inside function bodies on every call
causing unnecessary allocations; move them to module-level constants (e.g.,
define const FIRST_BYTE_MASKS = {2:0x1f,3:0x0f,4:0x07} and const
MIN_VALUE_FOR_LENGTH = {...}) and replace the in-function usages of
firstByteMasks and minValueForLength with these hoisted constants in the
functions that reference them (search for identifiers firstByteMasks and
minValueForLength to update all occurrences).

assets/js/erlang/unicode.mjs (1)

670-672: Nit: inconsistent comment casing.

Line 670 uses lowercase "scan forward" while the other four copies (Lines 98, 309, 563, 776) use "Scan forward". Trivial, but one more reason to extract the function.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 670 - 672, The comment "scan
forward, validating each sequence" is inconsistently cased compared to other
occurrences; update this comment to "Scan forward, validating each sequence" to
match the other instances (or better, extract the repeated logic into a helper
function used by the blocks that declare let pos = 0 to remove duplicated
comments). Locate the block containing the variable pos (the scanning loop) and
either change the comment text to title case or refactor the scan logic into a
single function (e.g., scanForward/validateSequence) and replace the duplicated
comments with a single well-named function call.

test/javascript/bitstring_test.mjs (1)

5267-5344: Clarify and de‑duplicate truncated‑sequence tests.
Two 4‑byte truncation tests use identical input, and the first test’s comment mentions a continuation byte that isn’t present. Consider removing the duplicate and aligning the comment or sample bytes for clarity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/javascript/bitstring_test.mjs` around lines 5267 - 5344, The two tests
asserting truncation for 4-byte sequences duplicate the same input ([0xf0, 0x90,
0x8d]) and the first test's comment is inconsistent; remove the duplicate test
(either the one titled "returns true for truncated 4-byte sequence with valid
continuation bytes" or "returns true for 4-byte sequence with 2 valid
continuation bytes (truncated)") and keep a single test that uses
Bitstring.isTruncatedUtf8Sequence with bytes [0xf0, 0x90, 0x8d], updating its
comment to state "only 3 bytes available (2 valid continuation bytes)" so the
test name, comment and sample bytes are aligned.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 97-112: The helper findValidUtf8Length is duplicated in five
places; move it into the Bitstring utility and call it from unicode.mjs instead
of inlining; specifically, add a method (e.g.,
Bitstring.findValidUtf8Length(bytes)) to bitstring.mjs that performs the same
loop using Bitstring.getUtf8SequenceLength and Bitstring.isValidUtf8Sequence,
export/attach it to the Bitstring class, then in unicode.mjs replace each
in-file findValidUtf8Length definition with direct calls to
Bitstring.findValidUtf8Length(bytes) from the functions characters_to_list/1,
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 258-259: The objects firstByteMasks and minValueForLength are
being re-created inside function bodies on every call causing unnecessary
allocations; move them to module-level constants (e.g., define const
FIRST_BYTE_MASKS = {2:0x1f,3:0x0f,4:0x07} and const MIN_VALUE_FOR_LENGTH =
{...}) and replace the in-function usages of firstByteMasks and
minValueForLength with these hoisted constants in the functions that reference
them (search for identifiers firstByteMasks and minValueForLength to update all
occurrences).

In `@assets/js/erlang/unicode.mjs`:
- Around line 670-672: The comment "scan forward, validating each sequence" is
inconsistently cased compared to other occurrences; update this comment to "Scan
forward, validating each sequence" to match the other instances (or better,
extract the repeated logic into a helper function used by the blocks that
declare let pos = 0 to remove duplicated comments). Locate the block containing
the variable pos (the scanning loop) and either change the comment text to title
case or refactor the scan logic into a single function (e.g.,
scanForward/validateSequence) and replace the duplicated comments with a single
well-named function call.

In `@test/javascript/bitstring_test.mjs`:
- Around line 5267-5344: The two tests asserting truncation for 4-byte sequences
duplicate the same input ([0xf0, 0x90, 0x8d]) and the first test's comment is
inconsistent; remove the duplicate test (either the one titled "returns true for
truncated 4-byte sequence with valid continuation bytes" or "returns true for
4-byte sequence with 2 valid continuation bytes (truncated)") and keep a single
test that uses Bitstring.isTruncatedUtf8Sequence with bytes [0xf0, 0x90, 0x8d],
updating its comment to state "only 3 bytes available (2 valid continuation
bytes)" so the test name, comment and sample bytes are aligned.

assets/js/erlang/unicode.mjs

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

assets/js/bitstring.mjs (1)

255-269: Object literals firstByteMasks and minValueForLength are re-allocated on every call.

Both decodeUtf8CodePoint (line 259) and isValidUtf8CodePoint (line 602) construct a new plain-object lookup table on each invocation. These tables are constant and could be hoisted to module-level or class-level const values to avoid repeated allocation, especially since these methods are called inside tight loops during UTF-8 scanning.

♻️ Proposed refactor (illustrative — apply equivalent changes to both methods)

+const FIRST_BYTE_MASKS = {2: 0x1f, 3: 0x0f, 4: 0x07};
+const MIN_VALUE_FOR_LENGTH = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};

 static decodeUtf8CodePoint(bytes, start, length) {
   if (length === 1) return bytes[start];

-  const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07};
-  let codePoint = bytes[start] & firstByteMasks[length];
+  let codePoint = bytes[start] & FIRST_BYTE_MASKS[length];
   ...
 }

 static isValidUtf8CodePoint(codePoint, encodingLength) {
-  const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
-  if (codePoint < minValueForLength[encodingLength]) return false;
+  if (codePoint < MIN_VALUE_FOR_LENGTH[encodingLength]) return false;
   ...
 }

Also applies to: 596-612

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 255 - 269, Hoist the per-call lookup
objects into shared constants to avoid re-allocation: move the firstByteMasks
and minValueForLength objects used in decodeUtf8CodePoint and
isValidUtf8CodePoint out of the functions and define them once at module (or
class static) scope as const FIRST_BYTE_MASKS and MIN_VALUE_FOR_LENGTH, then
replace the in-function object literals with references to those constants;
ensure you update both decodeUtf8CodePoint and isValidUtf8CodePoint to use the
new constant names (and keep behavior identical).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 638-657: The function isTruncatedUtf8Sequence currently reads
bytes[start] without bounds checking, causing out-of-range starts to be treated
as a valid truncated UTF-8; add an explicit bounds guard at the top of
isTruncatedUtf8Sequence to immediately return false when start is outside the
buffer (start < 0 or start >= bytes.length) before calling
$.getUtf8SequenceLength or $.isValidUtf8ContinuationByte, keeping the rest of
the logic intact.

---

Duplicate comments:
In `@assets/js/erlang/unicode.mjs`:
- Around line 97-112: The function findValidUtf8Length is duplicated across the
five Unicode helper blocks; extract a single shared implementation (keep the
current behavior that returns a numeric pos and uses
Bitstring.getUtf8SequenceLength and Bitstring.isValidUtf8Sequence) into one
reusable helper (e.g., a top-level/exported findValidUtf8Length in the module)
and replace the other verbatim copies by calling that shared helper; ensure the
helper name and signature remain findValidUtf8Length(bytes) returning a number
so callers in the existing Unicode helper functions continue to work without
further changes.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 255-269: Hoist the per-call lookup objects into shared constants
to avoid re-allocation: move the firstByteMasks and minValueForLength objects
used in decodeUtf8CodePoint and isValidUtf8CodePoint out of the functions and
define them once at module (or class static) scope as const FIRST_BYTE_MASKS and
MIN_VALUE_FOR_LENGTH, then replace the in-function object literals with
references to those constants; ensure you update both decodeUtf8CodePoint and
isValidUtf8CodePoint to use the new constant names (and keep behavior
identical).

assets/js/bitstring.mjs

mward-sudo · 2026-02-19T00:29:32Z

@coderabbitai Don't do any further reviews on this PR while it is in draft.

coderabbitai · 2026-02-19T00:29:55Z

@mward-sudo: Understood! I won't perform any further reviews on this PR while it remains in draft. The automatic reviews are now paused. Feel free to mention me again when you're ready for a review after the PR is out of draft status.

✅ Actions performed

Reviews paused.

mward-sudo added 3 commits February 18, 2026 22:39

Extracts UTF-8 code point decoding to the BitString utility class, ad…

9a49bec

…ds parameter validation

Extracts UTF-8 continuation byte validation to BitString class

105d2d9

Extracts Utf-8 code point validation to BitString class

7015625

mward-sudo mentioned this pull request Feb 18, 2026

Extracts UTF-8 valid length position to Bitstring class #719

Draft

5 tasks

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

assets/js/erlang/unicode.mjs Show resolved Hide resolved

mward-sudo added 2 commits February 18, 2026 23:50

Extracts UTF-8 sequence validation to BitString class

b9bfe2d

Extracts truncated UTF-8 sequence validation to BitString class

9ab7bc3

mward-sudo force-pushed the 02-18-extracts_truncated_utf-8_sequence_validation_to_bitstring_class branch from 6d289bb to 9ab7bc3 Compare February 18, 2026 23:51

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

assets/js/bitstring.mjs Show resolved Hide resolved

This was referenced Feb 19, 2026

Extracts BitString to code points array to BitString class #721

Draft

Extracts BitString from code point to BitString class #723

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Extracts truncated UTF-8 sequence validation to BitString class#718

Extracts truncated UTF-8 sequence validation to BitString class#718
mward-sudo wants to merge 5 commits intobartblast:devfrom
mward-sudo:02-18-extracts_truncated_utf-8_sequence_validation_to_bitstring_class

mward-sudo commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

mward-sudo commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mward-sudo commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading